Lambda Architecture Re-architected - Batch Layer

The purpose of this notebook is to showcase a Lambda Architecture uusing Apache Spark and Azure Cosmos DB using publicly available Twitter data. This notebook will focus on the batch layer.

Configure HDI Spark to connect to Cosmos DB Jars

As noted by using configure, we specify the memory, cores, drivers, executors as well as the JARs we are using.

In [1]:
%%configure
{ "name":"lambda-spark-cosmosdb-batch", 
  "executorMemory": "2G", 
  "executorCores": 2, 
  "numExecutors": 4, 
  "driverMemory": "10G",
  "jars": ["wasb:///example/jars/0.0.5/azure-cosmosdb-spark_2.1.0_2.11-0.0.5-SNAPSHOT.jar", "wasb:///example/jars/0.0.5/azure-documentdb-1.13.0.jar", "wasb:///example/jars/0.0.5/azure-documentdb-rx-0.9.0-rc2.jar", "wasb:///example/jars/0.0.5/json-20140107.jar", "wasb:///example/jars/0.0.5/rxjava-1.3.0.jar", "wasb:///example/jars/0.0.5/rxnetty-0.4.20.jar"],  
  "conf": {
    "spark.jars.excludes": "org.scala-lang:scala-reflect"
   }
}
Current session configs: {u'kind': 'pyspark', u'name': u'lambda-spark-cosmosdb-batch', u'driverMemory': u'10G', u'numExecutors': 4, u'conf': {u'spark.jars.excludes': u'org.scala-lang:scala-reflect'}, u'executorCores': 2, u'jars': [u'wasb:///example/jars/0.0.5/azure-cosmosdb-spark_2.1.0_2.11-0.0.5-SNAPSHOT.jar', u'wasb:///example/jars/0.0.5/azure-documentdb-1.13.0.jar', u'wasb:///example/jars/0.0.5/azure-documentdb-rx-0.9.0-rc2.jar', u'wasb:///example/jars/0.0.5/json-20140107.jar', u'wasb:///example/jars/0.0.5/rxjava-1.3.0.jar', u'wasb:///example/jars/0.0.5/rxnetty-0.4.20.jar'], u'executorMemory': u'2G'}
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
11application_1508803972461_0011sparkdeadLink
16application_1508803972461_0016pysparkdeadLink
20application_1508803972461_0020pysparkdeadLink

Configure Connection to Cosmos DB

The parameters below connect to the Cosmos DB.

In [2]:
database = "seahawks"
collection = "tweets"

tweetsConfig = {
"Endpoint" : "https://rainier.documents.azure.com:443/",
"Masterkey" : "lMdWMSAeooleaB5lRS1GhyoKly9lz8Q93kzo5GHBPyojroJilQD9PvK2qsQxh0n9uldn3ZULttsrRTGJL7u1lA==",
"Database" : database,
"Collection" : collection, 
"preferredRegions" : "Central US",
"SamplingRatio" : "1.0",
"schema_samplesize" : "200000",
"query_custom" : "SELECT c.id, c.created_at, c.user.screen_name, c.user.location, c.text, c.retweet_count, c.entities.hashtags, c.entities.user_mentions, c.favorited, c.source FROM c"
}
Starting Spark application
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
22application_1508803972461_0022pysparkidleLinkLink✔
SparkSession available as 'spark'.
In [3]:
tweets = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**tweetsConfig).load()

Create tweets TempView

This way we can run SQL statements within the notebook

In [4]:
tweets.createOrReplaceTempView("tweets")
tweets.cache()
DataFrame[created_at: string, favorited: boolean, hashtags: array<struct<indices:array<int>,text:string>>, id: string, location: string, retweet_count: int, screen_name: string, source: string, text: string, user_mentions: array<struct<id:double,id_str:string,indices:array<int>,name:string,screen_name:string>>]
In [5]:
%%sql
select count(1) from tweets
count(1)
0 7098

Show various attributes of the first 20 tweets

In [6]:
%%sql
select 
  id,
  created_at,
  screen_name,
  location,
  text,
  retweet_count,
  hashtags,
  user_mentions,
  favorited,
  source
from tweets
limit 20
id created_at screen_name location text retweet_count hashtags user_mentions favorited source
0 923997284907364353 2017-10-27 19:38:14 Digitally_Your United States RT @DollyRay1987: Privacy Concerns about the M... 0 [{u'indices': [90, 94], u'text': u'IoT'}, {u'i... [{u'id': 199583609.0, u'id_str': u'199583609',... False <a href="http://twitter.com" rel="nofollow">Tw...
1 923997291521851393 2017-10-27 19:38:15 Uzado_ Vaughan, Ontario RT @ParallelRecruit: 2017 #Cyber #Threat #Defe... 0 [{u'indices': [26, 32], u'text': u'Cyber'}, {u... [{u'id': 429008970.0, u'id_str': u'429008970',... False <a href="http://twitter.com" rel="nofollow">Tw...
2 923997350401470464 2017-10-27 19:38:30 Damaris_Goddrie United States RT @SachinLulla: Bots don't need personas, the... 0 [{u'indices': [108, 112], u'text': u'IoT'}, {u... [{u'id': 359009458.0, u'id_str': u'359009458',... False <a href="http://twitter.com" rel="nofollow">Tw...
3 923997371532537858 2017-10-27 19:38:35 IoTimelab Aix-en-Provence, France RT @eLearning_Feed: DevOps: How To Use Terrafo... 0 [{u'indices': [106, 110], u'text': u'iot'}, {u... [{u'id': 2902883435.0, u'id_str': u'2902883435... False <a href="https://timelab.io" rel="nofollow">Bo...
4 923997377752690698 2017-10-27 19:38:36 ProductiveSys null RT @ProductiveSys: #CyberRisk &amp; #IoT Imple... 0 [{u'indices': [19, 29], u'text': u'CyberRisk'}... [{u'id': 8.83450410837e+17, u'id_str': u'88345... False <a href="http://twitter.com/#!/download/ipad" ...
5 923997396455092224 2017-10-27 19:38:41 WIOMAX_VA United States RT @Yahyaalireal: What Does a Trump Election M... 0 [{u'indices': [53, 57], u'text': u'IoT'}, {u'i... [{u'id': 2744610402.0, u'id_str': u'2744610402... False <a href="http://www.wiomax.com/IoT" rel="nofol...
6 923997406592753664 2017-10-27 19:38:43 splicemachine San Francisco RT @mzweben: New .@SpliceMachine Spark Adapter... 0 [{u'indices': [66, 82], u'text': u'MachineLear... [{u'id': 2758621.0, u'id_str': u'2758621', u'i... False <a href="http://twitter.com" rel="nofollow">Tw...
7 923997408262008833 2017-10-27 19:38:43 jackcrumbly null RT @bobehayes: Top 15 #Analytics and #DataScie... 0 [{u'indices': [22, 32], u'text': u'Analytics'}... [{u'id': 20748873.0, u'id_str': u'20748873', u... False <a href="http://twitter.com/download/android" ...
8 923997412489703424 2017-10-27 19:38:44 LisaGanganeed Lombard, Illinois RT @DeepLearn007: #AI can build America's New ... 0 [{u'indices': [18, 21], u'text': u'AI'}, {u'in... [{u'id': 841437061.0, u'id_str': u'841437061',... False <a href="http://twitter.com" rel="nofollow">Tw...
9 923997413358026752 2017-10-27 19:38:45 RialtoBoard Rialto,CA RT @EdTech_K12: #AI in #k12 is on the increase... 0 [{u'indices': [16, 19], u'text': u'AI'}, {u'in... [{u'id': 67700256.0, u'id_str': u'67700256', u... False <a href="https://about.twitter.com/products/tw...
10 923997418630156288 2017-10-27 19:38:46 ProductiveSys null RT @ProductiveSys: Most Disruptive Tech Trends... 0 [{u'indices': [71, 79], u'text': u'BigData'}, ... [{u'id': 8.83450410837e+17, u'id_str': u'88345... False <a href="http://twitter.com/#!/download/ipad" ...
11 923997453015281670 2017-10-27 19:38:54 OpenDataMonkey The Netherlands RT @4orgexcellence: Put #bigdata to use in #g... 0 [{u'indices': [24, 32], u'text': u'bigdata'}, ... [{u'id': 1074139603.0, u'id_str': u'1074139603... False <a href="http://www.innonational.com" rel="nof...
12 923997479028137984 2017-10-27 19:39:00 NatureGeekRobin United States RT @SachinLulla: Bots don't need personas, the... 0 [{u'indices': [108, 112], u'text': u'IoT'}, {u... [{u'id': 359009458.0, u'id_str': u'359009458',... False <a href="http://twitter.com" rel="nofollow">Tw...
13 923997488599613440 2017-10-27 19:39:02 Theresa21Young United States RT @SachinLulla: Bots don't need personas, the... 0 [{u'indices': [108, 112], u'text': u'IoT'}, {u... [{u'id': 359009458.0, u'id_str': u'359009458',... False <a href="http://twitter.com" rel="nofollow">Tw...
14 923997550113308672 2017-10-27 19:39:17 maggsyoung94 United States RT @SachinLulla: Bots don't need personas, the... 0 [{u'indices': [108, 112], u'text': u'IoT'}, {u... [{u'id': 359009458.0, u'id_str': u'359009458',... False <a href="http://twitter.com" rel="nofollow">Tw...
15 923997566416465920 2017-10-27 19:39:21 BaroqueGuitar United States RT @SachinLulla: Bots don't need personas, the... 0 [{u'indices': [108, 112], u'text': u'IoT'}, {u... [{u'id': 359009458.0, u'id_str': u'359009458',... False <a href="http://twitter.com" rel="nofollow">Tw...
16 923997569725947910 2017-10-27 19:39:22 GitHubIT Italy RT @franckmercado: Machine learning library fo... 0 [{u'indices': [88, 104], u'text': u'machinelea... [{u'id': 14915984.0, u'id_str': u'14915984', u... False <a href="https://github.com/mojtabatmj/opensrc...
17 923997602504429568 2017-10-27 19:39:30 abhisheksinha_ Mumbai RT @giactorlai: More #MachineLearning for #qu... 0 [{u'indices': [21, 37], u'text': u'MachineLear... [{u'id': 4401335380.0, u'id_str': u'4401335380... False <a href="https://ifttt.com" rel="nofollow">IFT...
18 923997616450420736 2017-10-27 19:39:33 DorothyClark86 San Diego, CA RT @SachinLulla: Bots don't need personas, the... 0 [{u'indices': [108, 112], u'text': u'IoT'}, {u... [{u'id': 359009458.0, u'id_str': u'359009458',... False <a href="http://twitter.com" rel="nofollow">Tw...
19 923997647396065280 2017-10-27 19:39:40 abhisheksinha_ Mumbai RT @tunguz: Practical #MachineLearning: Innova... 0 [{u'indices': [22, 38], u'text': u'MachineLear... [{u'id': 23511272.0, u'id_str': u'23511272', u... False <a href="https://ifttt.com" rel="nofollow">IFT...

Determine Top 10 hashtags for the tweets

In [7]:
%%sql
select concat(concat((dense_rank() OVER (PARTITION BY 1 ORDER BY tweets DESC)-1), '. '), text) as hashtags, tweets
from (
select hashtags.text, count(distinct id) as tweets
from (
select 
  explode(hashtags) as hashtags,
  id
from tweets
) a
group by hashtags.text
order by tweets desc
limit 10
) b
In [8]:
%%sql
select hashtags.text, count(distinct id) as tweets
from (
select 
  explode(hashtags) as hashtags,
  id
from tweets
) a
where hashtags.text like '%sql%'
group by hashtags.text
order by tweets desc
text tweets
0 sqlpass 10
1 sqlserver 6
2 sql 6
3 nosql 3
4 sqlsatportugal 1
In [ ]: